Introduction

In this data analysis project we explore the dataset from the House Prices: Advanced Regression Techniques competition held on Kaggle. The dataset describes residential homes in Ames, Iowa. The goal of is project (and the Kaggle competition) is to predict the sale price of a home based on it’s other attributes. Theses attributes are represented in the dataset by an extensive 79 variables, including everything from the number of bathrooms, to the material the roof is made out of, to the year the home was built.

Our primary goal is to build a model which can predict house prices as accurately as possible. We want our model to work well with previously unseen examples. So to avoid overfitting, we will use cross validation accuracy to evaluate our candidate models.

As a secondary goal, we’d like our model to be stable and interperatable so we will also consider BIC, and Adjusted \(R^2\) when selecting a best model.


Methods

Data Preparation

First, we load the training data and extract the number of samples and the feature names:

train_data = read.csv('data/train.csv')
features = setdiff(colnames(train_data), c("Id", "SalePrice"))

The housing dataset includes 79 features and 1460 samples.

Dealing with Missing Predictor Values

Naturally, we expect that some examples are missing features, let’s try removing those samples and see how much data we’d have left to work with:

train_data_omit_na = na.omit(train_data)
nrow(train_data_omit_na)
## [1] 0

Bummer! It appears every sample has is missing at least one feature. So we can’t simply omit them or we’ll have nothing left to work with! For factor variables with missing values, we can simply create a new other category to fill in for NA values.

other_category = 'other'
for (f in features) {
  if (is.factor(train_data[[f]]) && any(is.na(train_data[[f]]))) {
    levels(train_data[[f]]) = c(levels(train_data[[f]]), other_category)
    train_data[[f]][is.na(train_data[[f]])] = other_category
  }
}

# Try omitting samples with missing features again:
train_data_omit_na = na.omit(train_data)
nrow(train_data_omit_na)
## [1] 1121

This appears to solve most of the problem, but we’re still losing 339 samples, since they have missing numeric feature values. Since they’re a large portion (23% of the dataset), we don’t want to ignore them. Unfortunately, finding suitable replacement values for numeric features is more difficult.

Let’s see which features have missing values:

Filter(function (f) any(is.na(train_data[[f]])), features)
## [1] "LotFrontage" "MasVnrArea"  "GarageYrBlt"

Great, we only need to deal with missing values for 3 numeric features. Let’s replace the NA values of these features with sensible defaults.

For MasVnrArea, the masonry veneer area in square feet, we have a large number of examples with a value of 0, so that should be a decent replacement. Since it doesn’t seem especially important, we might also consider dropping the feature entirely instead of the samples.

train_data$MasVnrArea[is.na(train_data$MasVnrArea)] = 0

Since GarageYrBlt and YearBuilt are highly correlated (0.8235195) and they’re identical 60% of the time, we’ll simply replace missing GarageYrBlt values with YearBuilt values.

train_data$GarageYrBlt[is.na(train_data$GarageYrBlt)] = train_data$YearBuilt[is.na(train_data$GarageYrBlt)]

Coming up with replacement values for LotFrontage (linear feet of street connected to property) is much more difficult. It seems too important of a feature to simply remove. There are also no cases where the value is 0, and so no obvious replacement value.

There are a few variables like LotArea, LotShape and LotConfig that seem like they might be predictive of LotFrontage. Let’s quickly build a model to predict LotFrontage and then use it to fill in our missing values.

lot_frontage_model = lm(LotFrontage ~ log(LotArea) + LotArea:LotConfig + LotShape + LandSlope + YearBuilt + BldgType, data = train_data_omit_na)

We can be confident the lot_frontage_model above is doing a good job predicting LotFrontage because:

  • The p-value of the significance of regression test is extremely small.
  • The adjusted \(R^2\) is fairly large: 0.583933
  • We’re using a small number of predictors without higher degree polynomials so we aren’t especially concerned with overfitting.

If we wanted to be even more confident, we might check the cross validated RMSE, but because we’re only using this model to fill in missing values, we won’t spend too much time on it.

Now we can fillin the missing LotFrontage values:

train_data$LotFrontage[is.na(train_data$LotFrontage)] = predict(lot_frontage_model, newdata = train_data[is.na(train_data$LotFrontage),])

# Check if there are any remaining NA values in the training data:
any(is.na(train_data))
## [1] FALSE

Success! We’ve managed to replace all of the missing values without losing any training samples.

Creating Additional Factor Variables

When we loaded the dataset, R decided which predictors were numeric and which were factors. Some predictors that R considers numeric, might work better as factor variables. For example, the price difference between a 0 and a 2 car garage is likely different than the price difference between a 2 and a 4 car garage.

for (f in features) {
  # We'll add any numeric predictor with less than 12 unique values as a factor variable
  if (!is.factor(train_data[[f]]) && length(unique(train_data[[f]])) <= 12) {
    train_data[[paste(f, 'Fctr', sep = "")]] = as.factor(train_data[[f]])
  }
}

# Update `features`:
features = setdiff(colnames(train_data), c("Id", "SalePrice"))

Great, we’ve added 14 new factor variables for a total of 93 features.

Removing Categories with Few Samples

// TODO: explain this

category_cutoff = nrow(train_data) * 0.05
for (f in features) {
  if (is.factor(train_data[[f]])) {
    for (category in levels(train_data[[f]])) {
      samples_in_cat = sum(na.omit(train_data[[f]]) == category)
      if (samples_in_cat < category_cutoff) {
        levels(train_data[[f]]) = c(levels(train_data[[f]]), other_category)
        train_data[[f]][train_data[[f]] == category] = other_category
      }
    }
  }
}

feature_cutoff = nrow(train_data) * 0.95
for (f in features) {
  if (is.factor(train_data[[f]])) {
    if (length(levels(train_data[[f]])) == 1) {
      train_data = train_data[,!(names(train_data) == f)]
    } else {
      for (category in levels(train_data[[f]])) {
        samples_in_cat = sum(na.omit(train_data[[f]]) == category)
        if (samples_in_cat > feature_cutoff) {
          train_data = train_data[,!(names(train_data) == f)]
        }
      }
    }
  }
}

Data Exploration

Next we quiclky inspect the correlations between the numeric predictors in our dataset to determine which might be useful in predicting SalePrice and which might have collinearity issues:

numeric_variables <- train_data[which(sapply(train_data, is.numeric))]
numeric_variables <- subset(numeric_variables, select = -c(Id))
correlations <- cor(numeric_variables)
corrplot(correlations, method="square", order ="FPC")

Next we visualize the numeric predictors that are most correlated with SalePrice:

correlated_indices = abs(correlations['SalePrice',]) > 0.2
correlated_predictors = names(correlations['SalePrice',correlated_indices])
par(mfrow = c(ceiling(length(correlated_predictors) / 5.0), 5), mar=c(1,1,1,1), oma = c(0, 0, 3, 0))
for (cor_pred in correlated_predictors) {
  plot(train_data[[cor_pred]], train_data$SalePrice, 
       main=cor_pred, col=rgb(0.1,0.1,0.1,0.3), pch=16, 
       xaxt='n', yaxt='n', xlab = NA, ylab  = NA)
}
mtext("Predictors' Relationships to SalePrice", outer=TRUE, cex = 2)

Model Evaluation

As an evaluation metric, the Kaggle competition uses Root-Mean-Squared-Error (RMSE) on the logorithm of the predicted and actual sale prices. Taking the logorithm ensures that expensive houses don’t have more of an impact on RMSE than cheap houses.

We’ll use this same metric as our primary quality indicator. To ensure we don’t overfit, we’ll apply it on a 5-fold cross validation set. We’ll also keep our eye on BIC, Adjusted \(R^2\) and a few other metrics.

get_bp_decision = function(model, alpha = 0.05) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_shapiro_decision = function(model, alpha = 0.05) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_log_rmse = function(observed, predicted) {
  sqrt(mean((log(predicted) - log(observed)) ^ 2))
}

get_cv_log_rmse = function(model_formula, data, folds = 5) {
  total_rmse = 0
  n = nrow(data)
  for (i in 1:folds) {
    idx = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.8, 0.2))
    training_set = data[idx,]
    cv_set = data[!idx,]
    model = cached_train(model_formula, data=training_set, method="glm")
    cv_predicted = predict(model, newdata = cv_set)
    # For stability, ignore all predictions are < 1.0
    nan_indices = is.nan(log(cv_predicted))
    total_rmse = total_rmse + get_log_rmse(cv_set$SalePrice[!nan_indices], cv_predicted[!nan_indices])
  }
  total_rmse / folds
}

evaluate_model = function(name, model, data) {
  set.seed(42)
  data.frame(Model = name,
             CV.Log.RMSE = get_cv_log_rmse(model_formula = model$terms,data =data), 
             Adj.R2 = rsq(model, adj=TRUE),
             BIC = BIC(model),
             Coefficients = length(coef(model)),
             Shapiro.Decision = get_shapiro_decision(model),
             BP.Decision = get_bp_decision(model)
             )
}

Model Building

We can see that our response variable SalePrice has a righ skew. A log transform can be used to adjust and normalize the distribution.

par(mfrow=c(1,2))
hist(train_data$SalePrice,main = "SalesPrice")
hist(log(train_data$SalePrice),main = "log(SalesPrice)")

Baseline Models

# Start with a simple, naive model using all predictors added together 
additive_large = lm(SalePrice ~ ., data = train_data)
additive_large_eval = evaluate_model("Additive Large", additive_large, train_data)
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/4d90f81082c9b2e7a354fc9e9924cc297183c69d.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/04e5210fafb9c5787da83779a7de084fd4665ef3.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/88d7535c6a5a3cb6d67ef0b67a9127e5df863768.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/ebf8a95246160dbd6028471f8bf8227ab90bd3f5.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/8d418116eed341487ec6a0b64301091573d9df0e.rds"
kable(additive_large_eval)
Model CV.Log.RMSE Adj.R2 BIC Coefficients Shapiro.Decision BP.Decision
Additive Large 0.1529607 0.880485 35015.8 172 Reject Reject
# Variables were selected based on the results of the Data Exploration section above
additive_small = lm(SalePrice ~ 
                      LotArea + X1stFlrSF + X2ndFlrSF + MasVnrArea + BedroomAbvGr + OverallQual + 
                      OverallCond  + ExterQual + KitchenQual + BsmtQual + BsmtExposure + ScreenPorch + 
                      Fireplaces + MSZoning + YearBuilt, data = train_data)
additive_small_eval = evaluate_model("Additive Small", additive_small,train_data)
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/4d90f81082c9b2e7a354fc9e9924cc297183c69d.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/04e5210fafb9c5787da83779a7de084fd4665ef3.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/88d7535c6a5a3cb6d67ef0b67a9127e5df863768.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/ebf8a95246160dbd6028471f8bf8227ab90bd3f5.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/8d418116eed341487ec6a0b64301091573d9df0e.rds"
kable(additive_small_eval)
Model CV.Log.RMSE Adj.R2 BIC Coefficients Shapiro.Decision BP.Decision
Additive Small 0.1529607 0.8266707 34694.53 25 Reject Reject

We can do much better on our target metric (CV.Log.RMSE) with the smaller model. Notice that the small additive model has far fewer parameters than the large one (25 vs 172) It seems that with so many predictors, it’s very easy to overfit to the training data.

Parameter Search using BIC

The small baseline model also has a better BIC. So it looks like BIC might be a good metric to use to avoid overfitting. Next, we’ll use BIC to do a backwards parameter search starting from additive_large model.

additive_bic_backward = cached_step(additive_large, k = log(nrow(train_data)), direction = 'backward', trace = 0)
## [1] "[cache] loaded step model from  ~/git/STAT420/final-project/cache/cdf38ed02d9e6f66ee157bbdc594ab4e22a3669d.rds"
additive_bic_backward_eval = evaluate_model("Additive Backward BIC", additive_bic_backward,train_data)
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/4d90f81082c9b2e7a354fc9e9924cc297183c69d.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/04e5210fafb9c5787da83779a7de084fd4665ef3.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/88d7535c6a5a3cb6d67ef0b67a9127e5df863768.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/ebf8a95246160dbd6028471f8bf8227ab90bd3f5.rds"
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/8d418116eed341487ec6a0b64301091573d9df0e.rds"
kable(additive_bic_backward_eval)
Model CV.Log.RMSE Adj.R2 BIC Coefficients Shapiro.Decision BP.Decision
Additive Backward BIC 0.1529607 0.8720961 34394.81 48 Reject Reject

Transformations

Doing a parameter search using backwards BIC managed improve our results, but it still fails the Shapiro-Wilks and Breusch-Pagan tests. This suggests we might be able to achieve better results by applying some transformations to our the response and/or predictors.

Based on the graphs in the Data Exploration section above, it appears that many predictors have an exponential or polynomial relationship with SalePrice.

TODO: log transform etc…

Log Transform Continuous Variables

numeric_variables <- names(which(sapply(train_data, is.numeric)))
categorical_variables <- setdiff(names(train_data),(numeric_variables))
# remove Id
numeric_variables <- numeric_variables[numeric_variables!= "Id" & numeric_variables!= "SalePrice"]
numeric_variables_data <- subset(train_data[numeric_variables])

par(mfrow=c(2,2))
tmp<-sapply(numeric_variables,function(var){
  hist(train_data[[var]],main = var)
  hist(log(train_data[[var]]),main = paste0("log(",var,")"))
})
## Warning in log(train_data[[var]]): NaNs produced

should_log_transform<-c('MSSubClass','LotArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','X1stFlrSF','X2ndFlrSF','LowQualFinSF','GrLivArea','BedroomAbvGr','TotRmsAbvGrd','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','X3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold')
new_train_data<-train_data
# apply log transforms to predictors
tmp<-sapply(should_log_transform,function(var){
  new_train_data[[var]] <<- log(1+ new_train_data[[var]]  )
})



all_factor_column_names<-build_column_factor_map(d = new_train_data)
model_additive = lm(log(SalePrice) ~ . -Id-MSSubClass, data = new_train_data)
pvals<-coef(summary(model_additive))[,"Pr(>|t|)"]
# get pvals less than .001 and make a new formula
pval_cols<-data.frame(var=names(pvals[pvals<0.001]))
pval_cols_used<-unique(merge(all_factor_column_names,pval_cols,by.x = "var",by.y = "var")$val)
pval_formula<-as.formula(paste("log(SalePrice)~", paste(pval_cols_used, collapse="+")))

influential_indexes <- as.vector(which(cooks.distance(model_additive) > (4 / length(cooks.distance(model_additive)))))
influential_indexes <- c(influential_indexes,as.vector(which(abs(resid(model_additive))>.5,0)))
pval_model<-lm(pval_formula, data=new_train_data[-influential_indexes])
pval_model_eval<-evaluate_model("pval",pval_model,new_train_data[-influential_indexes])
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/ecf1c165c5e719ef6195f3a6195941513976822e.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/4a983b37014c9580f170dd9971703909d0a5df25.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/93692f61d746cb98b3cd4ee1e1cb6f7b4d661e2f.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from  ~/git/STAT420/final-project/cache/9f96e94b8becfc8528f93f50d487843121b7f2c0.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
p <- predict(pval_model, newdata = new_train_data[-influential_indexes])
## Warning in predict.lm(pval_model, newdata = new_train_data[-
## influential_indexes]): prediction from a rank-deficient fit may be
## misleading
plot(exp(p), 
     new_train_data$SalePrice, 
     col = "dodgerblue", 
     pch = 20,
     main = "Prediction vs Actual",
     xlab = "Prediction",
     ylab = "Actual"
     )

# Results

model_results = rbind(additive_large_eval, additive_small_eval, additive_bic_backward_eval,pval_model_eval)
kable(model_results)
Model CV.Log.RMSE Adj.R2 BIC Coefficients Shapiro.Decision BP.Decision
Additive Large 0.1529607 0.8804850 35015.803 172 Reject Reject
Additive Small 0.1529607 0.8266707 34694.528 25 Reject Reject
Additive Backward BIC 0.1529607 0.8720961 34394.805 48 Reject Reject
pval 9.5472158 0.8472417 -1102.703 28 Reject Reject
best_model = additive_bic_backward # TODO: replace with model that's actually the best
# Fitted vs. Residuals Plot
plot(fitted(best_model), resid(best_model), col = 'grey',
     xlab = "Fitted", ylab = "Residuals", main = "Fitted vs. Residuals")
    abline(h = 0, col = 'orange', lwd = 2)

# Q-Q Plot
par(mfrow=(c(1,1)))
qqnorm(resid(pval_model), main = "Normal Q-Q Plot", col = 'grey')
qqline(resid(pval_model), col = 'dodgerblue', lwd = 2)

Discussion

Lets summarize what we did, we started with a dataset with a number of invalid values and a large number of predictors. Then we started with the largest possible additive model and reduced number of predictors using backward selection and AIC. We tested resulting model for LINE assumptions for linear regression and found that those assumptions are violated. Then we derived another model using predictor and response tranformation and train it on a dataset that excludes infuential data points. We tested this new model for LINE assumptions and found it was in line with these assumptions, finally we used test dataset for prediction and calculated train and test RMSE. Of course this model is still not the best possible model but can be used for some prediction and can be further iproved by applying further transformations. The final selected model uses both numerical and categorical predictors. Some of the notable predictors that become part of selected model are overall quality, year bulit, ground living area, overall contidions, garage area,lot area etc.. Clearly these all play a major role for determining sales price of a home. So we can conclude that selected model is in line with the practical considerations of housing market. We also tried using interactive predictiors but with such a large number of categorial predictors, the number of predictors were going very high and keeping R studio busy for hours. In future we would like to run R studio in more powerfull machine and explore areas of multithreaded processing to speed up training the model and submit result on Kaggle competition to see the result.

Challenges

The dataset we chose was challenging for a variety of reasons, we think that because there are so many columns and so few rows we can easily run into overfitting - so perhaps in the future we would be able to expand the number of observations. We have also seen that some of the AIC/BIC/train functions may take extremely long times - so caching was added so that we would be able to perform iterative updates to the formatting and text.

Future Work

Acknowledgments

We would like to thank Professor David Unger for his input has been invaluable during the course and lecture hours and also Professor David Dalpiaz for the material that was covered in the course.